Chocolate Bars Research

Background

We have been chosen by an Israeli food company to analyze worldwide chocolate bar and cocoa data. The purpose of our research is to reach a conclusion on which chocolate bars are best for them to import in order to stay competitive in the premium chocolate market in Israel. Our goal is to base our recommendations using different elements that make up chocolate bars, such as cocoa bean origin and number of ingredients. These options do not necessarily have to be similar, but their quality has to be the best of the best.

Warning: this research will make you want to consume chocolate!

Introduction

In this research we will focus on a chocolate bar data set. This dataset comes from the 2022 tidyTuesday data. The original source is “Flavors of Cacao” and was found in an article from Will Canniford on Kaggle.

Before diving into the data, a bit of an introduction to chocolate. Chocolate comes from cocoa beans, usually from South America. In relevance to our data, the percentage of cocoa used in each chocolate bar is displayed. Also, additional ingredients are used such as salt,cocoa butter, and sugar. Lastly, each chocolate bar in the data was given a rating from 1-5.

Goals:

  1. Tidy the dataset

  2. Visualize the dataset

  3. Analyze the dataset using methods learned throughout our course.

We will examine the relationship between different variables of chocolate, such as cocoa percent, rating, and the origin of the bean. In addition, we will analyze the correlation between origin of the cocoa bean, the country the bar was manufactured in and the rating that the chocolate bar received.

Part 1 - Import and Tidying Data

First, we need to import the data.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
## Rows: 2530 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): company_manufacturer, company_location, country_of_bean_origin, spe...
## dbl (3): ref, review_date, rating
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s take a brief look at our data, using glimpse().

glimpse(chocolate)
## Rows: 2,530
## Columns: 10
## $ ref                              <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2~
## $ company_manufacturer             <chr> "5150", "5150", "5150", "5150", "5150~
## $ company_location                 <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.~
## $ review_date                      <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2~
## $ country_of_bean_origin           <chr> "Tanzania", "Dominican Republic", "Ma~
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat~
## $ cocoa_percent                    <chr> "76%", "76%", "76%", "68%", "72%", "8~
## $ ingredients                      <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "~
## $ most_memorable_characteristics   <chr> "rich cocoa, fatty, bready", "cocoa, ~
## $ rating                           <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3~

The following table explains each data type:

variable class description
ref integer Reference ID, The highest REF numbers were the last entries made.
company_manufacturer character Manufacturer name
company_location character Manufacturer region
review_date integer Review date (year)
country_of_bean_origin character Country of origin
specific_bean_origin_or_bar_name character Specific bean or bar name
cocoa_percent character Cocoa percent (% chocolate)
ingredients character Ingredients, (“#” = represents the number of ingredients in the chocolate; B = Beans, S = Sugar, S* = Sweetener other than white cane or beet sugar, C = Cocoa Butter, V = Vanilla, L = Lecithin, Sa = Salt)
most_memorable_characteristics character Most Memorable Characteristics column is a summary review of the most memorable characteristics of that bar. Terms generally relate to anything from texture, flavor, overall opinion, etc. separated by ‘,’
rating double rating between 1-5

In order to work with the data properly, we need to make a few adjustments.

  1. Adjust the chocolate percentage value into an integer value for calculations later

  2. Organize the data by manufacturing country and origin of cocoa bean.

  3. Organize the ingredients column, separating into a column for integer values based on number of ingredients, as well as boolean values based on ingredients in the bar itself.

  4. Lastly, we analyze the data using the ref values to isolate the top median. The higher ref values represent the newer data.

chocolate_new <- chocolate %>% 
  mutate(cocoa_percent = str_remove(cocoa_percent, "%")) %>% 
  mutate(cocoa_percent = as.numeric(cocoa_percent))

This code turns the chocolate percentage value into an integer to be easily analyzed later.

count(chocolate_new,company_location)
## # A tibble: 67 x 2
##    company_location     n
##    <chr>            <int>
##  1 Amsterdam           12
##  2 Argentina            9
##  3 Australia           53
##  4 Austria             30
##  5 Belgium             63
##  6 Bolivia              2
##  7 Brazil              25
##  8 Canada             177
##  9 Chile                2
## 10 Colombia            29
## # ... with 57 more rows
count(chocolate_new,country_of_bean_origin)
## # A tibble: 62 x 2
##    country_of_bean_origin     n
##    <chr>                  <int>
##  1 Australia                  3
##  2 Belize                    76
##  3 Blend                    156
##  4 Bolivia                   80
##  5 Brazil                    78
##  6 Burma                      1
##  7 Cameroon                   3
##  8 China                      1
##  9 Colombia                  79
## 10 Congo                     11
## # ... with 52 more rows

Here we count how many different countries exist in our data, and then proceed to enter them in lists categorized by continents. Our goal here is to isolate the data by continents in order to easily and cleanly visualize our chocolate bar data. Due to their being over 60 different countries, this data is not clearly visualized.

North_America<-c("Canada","Martinique","St. Lucia","St.Vincent-Grenadines","U.S.A.","Grenada","Haiti","Jamaica","Tobago","Trinidad","Mexico")
South_America<-c("Argentina","Bolivia","Brazil","Chile","Colombia","Costa Rica","Dominican Republic","Ecuador","El Salvador","Fiji","Guatemala","Honduras","Peru","Puerto Rico","Suriname","Venezuela","Belize","Bolivia","Brazil","Cuba","Nicaragua","Panama","Papua New Guinea")
Europe<-c("Amsterdam","Austria","Belgium","Czech Republic","Denmark","Finland","France","Germany","Hungary","Iceland","Ireland","Israel","Italy","Lithuania","Netherlands","Norway","Poland","Portugal","Scotland","Spain","Sweden","Switzerland","U.K.","Wales")
Africa<-c("Ghana","Madagascar","Nicaragua","Sao Tome","Sao Tome & Principe","South Africa","Cameroon","Congo","DR Congo","Gabon","Ivory Coast","Liberia","Principe","Sierra Leone","Tanzania","Togo","Uganda","Nigeria")
Asia<-c("India","Japan","Russia","Singapore","South Korea","Taiwan","Thailand","U.A.E.","Vietnam","Burma","China","Indonesia","Malaysia","Sri Lanka","Sulawesi","Sumatra","Philippines")
Australia_pacific<-c("Australia","Fiji","New Zealand","Vanuatu","Samoa","Solomon Islands","Vanuatu")

chocolate_new['company_continents'] <- NA
chocolate_new['cocoa_origin_continents'] <- NA

for(i in 1:nrow(chocolate_new))
{

  if(is.element(chocolate_new$company_location[i], North_America))
  {
    
    chocolate_new$company_continents[i] <- "North_America"
    
  }
  else if(is.element(chocolate_new$company_location[i], South_America)) 
  {
   
    chocolate_new$company_continents[i] <- "South_America"
  }
  else if(is.element(chocolate_new$company_location[i], Europe)) 
  {
   
    chocolate_new$company_continents[i] <- "Europe"
  }
  else if(is.element(chocolate_new$company_location[i], Africa)) 
  {
   
    chocolate_new$company_continents[i] <- "Africa"
  }
  else if(is.element(chocolate_new$company_location[i], Asia)) 
  {
    chocolate_new$company_continents[i] <- "Asia"
  }
  else if(is.element(chocolate_new$company_location[i], Australia_pacific)) 
  {
    chocolate_new$company_continents[i] <- "Australia_pacific"
  }
  else  
  {
    chocolate_new$cocoa_origin_continents[i] <- "Blend"
  }
  
}
for(i in 1:nrow(chocolate_new))
{

  if(is.element(chocolate_new$country_of_bean_origin[i], North_America))
  {
    
    chocolate_new$cocoa_origin_continents[i] <- "North_America"
    
  }
  else if(is.element(chocolate_new$country_of_bean_origin[i], South_America)) 
  {
   
    chocolate_new$cocoa_origin_continents[i] <- "South_America"
  }
  else if(is.element(chocolate_new$country_of_bean_origin[i], Europe)) 
  {
   
    chocolate_new$cocoa_origin_continents[i] <- "Europe"
  }
  else if(is.element(chocolate_new$country_of_bean_origin[i], Africa)) 
  {
   
    chocolate_new$cocoa_origin_continents[i] <- "Africa"
  }
  else if(is.element(chocolate_new$country_of_bean_origin[i], Asia)) 
  {
    chocolate_new$cocoa_origin_continents[i] <- "Asia"
  }
  else if(is.element(chocolate_new$country_of_bean_origin[i], Australia_pacific)) 
  {
    chocolate_new$cocoa_origin_continents[i] <- "Australia_pacific"
  }
  else 
  {
    chocolate_new$cocoa_origin_continents[i] <- "Blend"
  }
  
}

Here, we created a list for each continent in which we entered each country that appeared in our count() functions before. Using two for loops and if conditions, we created two new columns that show what continent the company_location is in as well as the cocoa_bean_origin_location. These two columns can now be used to easily visualize our data based on continent values.

#6-B,S,C,V,L,Sa
# B = Beans, S = Sugar, S* = Sweetener other than white cane or beet sugar, C = Cocoa Butter, V = Vanilla, L = Lecithin, Sa = Salt)

chocolate_new[is.na(chocolate_new)] = "0"
chocolate_new['num_of_ingredients'] <- NA 
chocolate_new['Beans'] <- NA 
chocolate_new['Sugar'] <- NA 
chocolate_new['Sweetener'] <- NA 
chocolate_new['Cocoa_Butter'] <- NA 
chocolate_new['Vanilla'] <- NA 
chocolate_new['Lecithin'] <- NA 
chocolate_new['Salt'] <- NA 


chocolate_new <- chocolate_new %>%
  mutate(Sweetener = (str_detect(ingredients, fixed("S")) & str_detect(ingredients, '\\*'))) %>%
  
  mutate(Salt = (str_detect(ingredients,fixed("S")) & str_detect(ingredients,fixed("a")))) %>%
  
  mutate(Sugar = (str_detect(ingredients, 'S') & str_detect(ingredients,'\\*', negate = TRUE) & str_detect(ingredients,'a', negate =  TRUE)))  %>%
  
  mutate(Beans = str_detect(ingredients, "B")) %>%
  
  mutate(Cocoa_Butter = str_detect(ingredients, "C")) %>%
  
  mutate(Vanilla = str_detect(ingredients, "V")) %>%
  
  mutate(Lecithin = str_detect(ingredients, "L"))   %>%
  
  mutate(num_of_ingredients = as.numeric(gsub("([0-9]+).*$", "\\1",ingredients)))

In order to separate each ingredient that a certain chocolate bar contains, we created binary columns for each ingredient. In addition, we created a new column that separated the total number of ingredients in that bar.

hypothesis_two <- chocolate_new
hypothesis_three <- chocolate_new
regression <- chocolate_new
#copy of table before changes to be used in hypothesis tests and regression later

median_ref = median(chocolate_new$ref)
chocolate_new <- chocolate_new %>%
  filter(ref >= median_ref)

chocolate_final = subset(chocolate_new, select = -c(ref,review_date))

Since ref indicates recently updated data, we organized our data based on samples higher than the median of ref.

In addition, chocolate_final contains our tidy data without irrelevant columns such as ref and review_date.

Part 2 - Visualization

In order to proceed with researching the top chocolate bars , we will need to properly visualize our data. We will start with the larger picture, using data per continent. We will narrow down our results to the top continents, then to countries. This will allow us to “zoom in” in order to reach the best chocolate bars.

hypothesis_one <- chocolate_final
#copy of table before changes to be used in hypothesis tests later
count_manufacturer <- count(chocolate_final,company_continents)

p1 <- ggplot(chocolate_final, aes(company_continents,rating))
p1 +  geom_boxplot(fill = "Red") + geom_text(data = count_manufacturer, aes(y = 1.5, label = n))

(1) Using a Box plot we visualize manufacturer rating by continent.

Here we can conclude that although Australia seems to be the best continent, there are less chocolate bars manufactured there. (as seen by the count function) The Box Plot appears as if all continents(except Africa) seem to be exactly the same, but it is important to note that North America has more samples than the rest. We can clearly see this from counting the samples per continent. Therefore, we will choose to use chocolate manufactured in North America as this data should be more accurate. This theory will be further analyzed using hypothesis tests(using the saved table hypothesis_one).

count_origin <- count(chocolate_final,cocoa_origin_continents)

p2 <- ggplot(chocolate_final, aes(cocoa_origin_continents, rating))
p2 +  geom_boxplot(fill = "Red") + geom_text(data = count_origin, aes(y = 1.5, label = n))

(2) Using a Box plot we created a visualization of ratings per chocolate bar by their cocoa bean origin continent.

For the same reason that we chose North America before, we can easily see that the top choice here is South America. Our count function proves this choice as most of the cocoa used in chocolate bars worldwide comes from forests in South America.

In these two Box Plots, the medians of all continents are the same(except for Africa). Therefore, we used the count function to choose continents by sample size.

chocolate_final <- chocolate_final %>%
  filter(grepl('South_America', cocoa_origin_continents))%>%
  filter(grepl('North_America', company_continents))

p3 <- ggplot(chocolate_final, aes(company_location, rating))
p3 +  geom_boxplot(fill ="Red") 

Here, we narrowed our results down to chocolate in which the cocoa was harvested in South America and manufactured in North America. This yields chocolate manufactured in Canada and U.S.A.. We can see that the results are very similar, therefore we will continue to use data from both of the countries.

After choosing which countries to use, we want to analyze the optimal number of ingredients for a high rating.

p4 <- ggplot(chocolate_final, aes(x = num_of_ingredients, fill = factor(rating), y = 1))
p4 +  geom_col(position = "fill") + scale_fill_brewer(palette = "RdYlGn") + coord_flip()  +labs(title= "Plot of rating by number of ingredients", x = "Number of Ingredients",y = "Scale", fill = "Rating")

count(chocolate_final, num_of_ingredients)
## # A tibble: 5 x 2
##   num_of_ingredients     n
##                <dbl> <int>
## 1                  1     1
## 2                  2   210
## 3                  3   224
## 4                  4    33
## 5                  5     4

After inspecting the rating’s on each bar by number of ingredients, and counting them by category (To see if we have enough data), we can see that 2 and 3 ingredients yield the highest rated chocolate bars.

We will check this hypothesis later: 2-3 ingredients bars giving the best rating.

Now we can filter our data, and finally choose the best 5 chocolate bars from American or Canadian manufacturers that use cacao from South America and use 2-3 ingredients in their chocolate bars.

chocolate_final <- chocolate_final %>%
  select(company_manufacturer, specific_bean_origin_or_bar_name,country_of_bean_origin, cocoa_percent, rating, num_of_ingredients, Beans, Sugar, Sweetener, Cocoa_Butter, Vanilla, Lecithin, Salt) %>%
           filter(num_of_ingredients %in% (2:3))


count(chocolate_final,Beans)
## # A tibble: 1 x 2
##   Beans     n
##   <lgl> <int>
## 1 TRUE    434
count(chocolate_final,Sugar)
## # A tibble: 2 x 2
##   Sugar     n
##   <lgl> <int>
## 1 FALSE    10
## 2 TRUE    424
count(chocolate_final,Sweetener)
## # A tibble: 2 x 2
##   Sweetener     n
##   <lgl>     <int>
## 1 FALSE       424
## 2 TRUE         10
count(chocolate_final,Cocoa_Butter)
## # A tibble: 2 x 2
##   Cocoa_Butter     n
##   <lgl>        <int>
## 1 FALSE          210
## 2 TRUE           224
count(chocolate_final,Vanilla)
## # A tibble: 1 x 2
##   Vanilla     n
##   <lgl>   <int>
## 1 FALSE     434
count(chocolate_final,Lecithin)
## # A tibble: 1 x 2
##   Lecithin     n
##   <lgl>    <int>
## 1 FALSE      434
count(chocolate_final,Salt)
## # A tibble: 1 x 2
##   Salt      n
##   <lgl> <int>
## 1 FALSE   434
chocolate_choices <- chocolate_final %>%
  select(company_manufacturer, specific_bean_origin_or_bar_name,country_of_bean_origin, cocoa_percent, rating, num_of_ingredients,Beans, Sugar, Sweetener, Cocoa_Butter) %>%
           filter(num_of_ingredients %in% (2:3))  %>%
           filter(rating >= 4)

After filtering out the data, we used the count function to check the most popular ingredients. We observed that all bars have Beans, Sugar or Sweetener, and none of them have Vanilla, Lecithin, or Salt. We can differentiate between the chocolate based on cocoa percentages, sugar , and cocoa butter.

After this analysis, we created a new table chocolate_choices, that also only includes ratings above or equal to 4. We also observed that these bars with ratings equal to 4 have no sweetener and only Sugar, so this variable becomes irrelevant.

Most of our options have 70% cocoa, therefore these are our most popular choices

To conclude, the 5 top chocolate bars (Manufactured in Canada/USA, 2-3 ingredients,cocoa from South America) that we will further analyze are:

  1. One that has 70% Cocoa, Sugar, and without Cocoa Butter:
first_choice <- chocolate_choices %>%
            filter(cocoa_percent == 70) %>%
            filter(Sugar == TRUE)%>%
            filter(Cocoa_Butter == FALSE)
  1. One that has 70% Cocoa, Sugar, and with Cocoa Butter:
second_choice <- chocolate_choices %>%
            filter(cocoa_percent == 70) %>%
            filter(Sugar == TRUE)%>%
            filter(Cocoa_Butter == TRUE)
  1. One that has over than 70% Cocoa, Sugar and without Cocoa Butter:
third_choice <- chocolate_choices %>%
            filter(cocoa_percent > 70) %>%
            filter(Sugar == TRUE)%>%
            filter(Cocoa_Butter == FALSE)
  1. One that has over than 70% Cocoa, Sugar and with Cocoa Butter:
fourth_choice <- chocolate_choices %>%
            filter(cocoa_percent > 70) %>%
            filter(Sugar == TRUE)%>%
            filter(Cocoa_Butter == TRUE)
  1. One that has less than 70% Cocoa (There is only one in the observed data):
fifth_choice <- chocolate_choices %>%
            filter(cocoa_percent < 70)

We can see that most of the chocolate that has 70% cacao has a high rating. We want to further test this using a double sided hypothesis test over all our data.

Part 3.1 - Modeling with Hypothesis Tests

We came to some conclusions through observing visualizations, and obviously this is not completely accurate.

We can prove (or disprove) these conclusions using models learned in class.

  1. We chose chocolate manufactured in North America since we believe that the data is more accurate, especially since the sample size is larger even when Australia seems to have more chocolate with a rating above 3.25 (median of box plot).

  2. We chose to focus on chocolate with 2-3 ingredients as we believe that their rating is the highest. This hypothesis was composed from the plot “rating by number of ingredients”

  3. We assumed that chocolate bars with 70 percent cacao are the most popular, but we want to prove this with a hypothesis test. By popular, we are referring to the percentage of values in the total sample.

First, we will test that the mean difference (wherever the rating is at least 3.25) between North America and Australia is equal to 0.

Aus <- hypothesis_one %>%
  filter(rating >= 3.25)  %>%
  filter( company_continents == "Australia_pacific") %>%
  select(rating)

North <- hypothesis_one %>%
  filter(rating >= 3.25)  %>%
  filter( company_continents == "North_America") %>%
  select(rating)

Difference in Means Test

H0: Mean(North_America) - Mean(Australia_pacific) = 0

H1: Mean(North_America) - Mean(Australia_pacific) > 0

(while ratings >= 3.25)

t.test(x = North$rating,y = Aus$rating,alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  North$rating and Aus$rating
## t = -1.248, df = 38.994, p-value = 0.8903
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.1184045        Inf
## sample estimates:
## mean of x mean of y 
##  3.492473  3.542857

Conclusion:

We reject H0 and accept H1: meaning our choice to use data from North America was correct. The mean rating from North America comes out is greater than Australia meaning that the difference in means is greater than 0 at a confidence level of 0.95.

Difference in means - Test 2:

H0: the ratings of chocolate bars with 2-3 ingredients is the same as the rest

H1: The ratings of chocolate bars with 2-3 ingredients is higher than those with 0 or 1 or at least 4

Before testing we need to make sure that the prerequisites to the central limit theorem exist.

two_three_ingredients <- hypothesis_two %>%
  filter(num_of_ingredients == 2 | num_of_ingredients == 3) 
  
count(two_three_ingredients,rating)
## # A tibble: 11 x 2
##    rating     n
##     <dbl> <int>
##  1   1.5      2
##  2   1.75     1
##  3   2        4
##  4   2.25    11
##  5   2.5     93
##  6   2.75   226
##  7   3      334
##  8   3.25   365
##  9   3.5    428
## 10   3.75   225
## 11   4       84
rest_ingredients <- hypothesis_two %>%
  filter(num_of_ingredients == 0 | num_of_ingredients == 1 | num_of_ingredients == 4| num_of_ingredients == 5| num_of_ingredients == 6) 

count(rest_ingredients,rating)
## # A tibble: 12 x 2
##    rating     n
##     <dbl> <int>
##  1   1        4
##  2   1.5      8
##  3   1.75     2
##  4   2       29
##  5   2.25     6
##  6   2.5     73
##  7   2.75   107
##  8   3      189
##  9   3.25    99
## 10   3.5    137
## 11   3.75    75
## 12   4       28
#both counts come out to over 30, meaning we can use the central limit theorem.

t.test(x = two_three_ingredients$rating,y = rest_ingredients$rating,alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  two_three_ingredients$rating and rest_ingredients$rating
## t = 8.1095, df = 1165.5, p-value = 6.385e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1357884       Inf
## sample estimates:
## mean of x mean of y 
##  3.247321  3.076948

Conclusion: We reject H0 and accept H1: meaning our choice to use data from with 2-3 ingredients was correct. The mean rating from chocolate bars with 2-3 ingredients comes out to be greater than those with more or less ingredients. This shows that the difference in means is greater than 0 at a confidence level of 0.95.

This test was successful, but not completely accurate as it used groups that fit our needs. In our regression tests, we will test each number of ingredients and their affect on rating.

Hypothesis test 3:

H0: proportion of Chocolate Bars with 70% cocoa is 33% (compared with less than 70% cocoa and more than 70% cocoa)

H1: proportion of Chocolate Bars with 70% cocoa is more than 33% (More popular than the others categories)

Percent = c("Seventy","More","Less")
Freq = c(sum(70 == hypothesis_three$cocoa_percent), sum(70 > hypothesis_three$cocoa_percent), sum(70 < hypothesis_three$cocoa_percent))
cocoa_table = data.frame(Percent, Freq)
prop.test(x = cocoa_table$Freq[1], n = sum(Freq),p = 0.3333 ,alternative = "greater")
## 
##  1-sample proportions test with continuity correction
## 
## data:  cocoa_table$Freq[1] out of sum(Freq), null probability 0.3333
## X-squared = 72.76, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is greater than 0.3333
## 95 percent confidence interval:
##  0.3972394 1.0000000
## sample estimates:
##         p 
## 0.4134387

Conclusion: We see that the sample estimates that p>0.3333 and that we should not accept H0, therefore we accept H1 that shows that our assumption choosing 70% percent cocoa as indication was correct with 95% confidence level.

To test how cocoa percent influence the rating of chocolate bars we will use the regression modeling later on.

Part 3.2 - Modeling With Linear Regression

In this part we will check how the rating of a chocolate bar is influenced by 2 variables:

  1. The number of ingredients.

  2. The cocoa percent.

Regression Test

We want to test if their is a linear regression between the rating of a chocolate bar and its percentage of cocoa. We added ingredients into the formula to check each ingredient individually in order to test the cacao percent on its rating. Before we reach any conclusions, we need to make sure the residuals distribute normally and are homoscedastic.

model <- lm(rating ~ cocoa_percent + num_of_ingredients + Beans + Sugar + Sweetener + Vanilla + Cocoa_Butter + Lecithin + Salt, data=regression)

ggplot(model,aes(x = cocoa_percent,y = rating, color = factor(Beans + Sugar + Sweetener + Vanilla + Cocoa_Butter + Lecithin + Salt))) + geom_point() + stat_smooth(formula = y ~ x, method = "lm") + scale_colour_manual( name = "Ingredient", values = c("1" = "grey", "2" = "pink", "3" = "blue", "4" = "brown", "5" = "red", "6" = "black"),labels = c("1"="Sugar", "2"="Sweetener", "3"="Vanilla", "4"="Cocoa_Butter", "5"="Lecithin", "6"="Salt")) 

#Checking if the residuals distributing normally: 
plot(model,2)

#Checking Homoscedasticity: 
plot(model,1)

We can see that the residuals do not distribute normally. In a normal research we would stop here. In addition, we can see that the variables are heteroscedastic.

Nonetheless, we continue as if the distribution is normal.

summary(model)
## 
## Call:
## lm(formula = rating ~ cocoa_percent + num_of_ingredients + Beans + 
##     Sugar + Sweetener + Vanilla + Cocoa_Butter + Lecithin + Salt, 
##     data = regression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98845 -0.27716  0.00201  0.26893  1.12265 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.706763   0.129733  28.572  < 2e-16 ***
## cocoa_percent      -0.011849   0.001605  -7.385 2.07e-13 ***
## num_of_ingredients -0.260121   0.226772  -1.147   0.2515    
## BeansTRUE           0.696471   0.369207   1.886   0.0594 .  
## SugarTRUE           0.194408   0.154431   1.259   0.2082    
## SweetenerTRUE      -0.032835   0.142622  -0.230   0.8179    
## VanillaTRUE         0.041580   0.230155   0.181   0.8566    
## Cocoa_ButterTRUE    0.302445   0.227547   1.329   0.1839    
## LecithinTRUE        0.212346   0.227814   0.932   0.3514    
## SaltTRUE            0.280708   0.295711   0.949   0.3426    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.427 on 2520 degrees of freedom
## Multiple R-squared:  0.08401,    Adjusted R-squared:  0.08074 
## F-statistic: 25.68 on 9 and 2520 DF,  p-value: < 2.2e-16

First, we build our regression model. After graphing, we can see that the spread of our points is very wide, therefore we believe that our model is weak. As noted, the residuals do not distribute normally.

We then proceed to summarize our model using rating,cocoa percent, and each ingredient separately(in binary form). We can see that R2 is very small, therefore our model only describes a very small portion of our errors. Regarding cocoa percent, we expected that the higher the percentage, the higher the rating(most of the chocolate in our data is around 70% cocoa). The connection is negative, although it is very close to zero showing that the effect is very minimal. It is important to note here that the regression is significant.

Nonetheless, as expected, the lower the number of ingredients, the higher the rating. The connection is negative, but not significant(as expected). 2-3 ingredients was shown to have higher ratings than those with 0 or 1 ingredient.

In addition, there is a positive connection between adding specific ingredients(except sweetener):

Beans: The only ingredient which is significant. Exists in all chocolate bars that have a rating higher than 3.25, as shown before. The connection here is positive.

Sugar: Not significant, although we chose to use data with sugar, as all chocolate with a rating above 4 has sugar and not sweetener.

Sweetener: The only negative connection, although minimal. Sweetener and sugar are substitutes, so this connection is logical as sweetener is positive.

Vanilla: Positive but very close to zero and not significant. As expected, this ingredient barely affects rating.

Cocoa Butter: While found in most bars with a rating over 4, we would expect that the connection here would be significant.

Lecithin,Salt: Positive connection, but we expected that the effect on rating for each one would be weaker. Not significant

It is important to note that this model in regards to ingredients (except for beans) is not significant, therefore not completely accurate.

From the F-statistic test, we can say that we deny H0 , and accept H1, meaning the variables are dependent.

(PV<0.05 -> Bi0)

As bonus, we want to check if we can improve the model by removing variables using ‘backward’ steps:

model_backwards<- step(model, direction = "backward")
## Start:  AIC=-4296.35
## rating ~ cocoa_percent + num_of_ingredients + Beans + Sugar + 
##     Sweetener + Vanilla + Cocoa_Butter + Lecithin + Salt
## 
##                      Df Sum of Sq    RSS     AIC
## - Vanilla             1    0.0059 459.40 -4298.3
## - Sweetener           1    0.0097 459.40 -4298.3
## - Lecithin            1    0.1584 459.55 -4297.5
## - Salt                1    0.1643 459.56 -4297.4
## - num_of_ingredients  1    0.2399 459.63 -4297.0
## - Sugar               1    0.2889 459.68 -4296.8
## - Cocoa_Butter        1    0.3221 459.72 -4296.6
## <none>                            459.39 -4296.3
## - Beans               1    0.6487 460.04 -4294.8
## - cocoa_percent       1    9.9414 469.34 -4244.2
## 
## Step:  AIC=-4298.32
## rating ~ cocoa_percent + num_of_ingredients + Beans + Sugar + 
##     Sweetener + Cocoa_Butter + Lecithin + Salt
## 
##                      Df Sum of Sq    RSS     AIC
## - Sweetener           1    0.0356 459.44 -4300.1
## <none>                            459.40 -4298.3
## - Sugar               1    0.4328 459.83 -4297.9
## - Salt                1    0.9388 460.34 -4295.2
## - Lecithin            1    3.2621 462.66 -4282.4
## - Beans               1    4.2266 463.63 -4277.1
## - Cocoa_Butter        1    9.8788 469.28 -4246.5
## - cocoa_percent       1   10.1397 469.54 -4245.1
## - num_of_ingredients  1   12.5085 471.91 -4232.3
## 
## Step:  AIC=-4300.12
## rating ~ cocoa_percent + num_of_ingredients + Beans + Sugar + 
##     Cocoa_Butter + Lecithin + Salt
## 
##                      Df Sum of Sq    RSS     AIC
## <none>                            459.44 -4300.1
## - Salt                1    1.2483 460.68 -4295.3
## - Sugar               1    2.8419 462.28 -4286.5
## - Lecithin            1    3.2992 462.73 -4284.0
## - Beans               1    8.1206 467.56 -4257.8
## - Cocoa_Butter        1    9.8864 469.32 -4248.3
## - cocoa_percent       1   10.2300 469.67 -4246.4
## - num_of_ingredients  1   12.5348 471.97 -4234.0
summary(model_backwards)
## 
## Call:
## lm(formula = rating ~ cocoa_percent + num_of_ingredients + Beans + 
##     Sugar + Cocoa_Butter + Lecithin + Salt, data = regression)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.98840 -0.27822  0.00224  0.26835  1.12359 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.694179   0.126509  29.201  < 2e-16 ***
## cocoa_percent      -0.011682   0.001559  -7.494 9.21e-14 ***
## num_of_ingredients -0.219632   0.026477  -8.295  < 2e-16 ***
## BeansTRUE           0.591068   0.088529   6.677 3.00e-11 ***
## SugarTRUE           0.219543   0.055584   3.950 8.04e-05 ***
## Cocoa_ButterTRUE    0.261936   0.035556   7.367 2.36e-13 ***
## LecithinTRUE        0.172651   0.040570   4.256 2.16e-05 ***
## SaltTRUE            0.247095   0.094393   2.618   0.0089 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4268 on 2522 degrees of freedom
## Multiple R-squared:  0.08393,    Adjusted R-squared:  0.08139 
## F-statistic: 33.01 on 7 and 2522 DF,  p-value: < 2.2e-16

We can conclude from the backward linear model, that if we remove vanilla and sweetener (as AIC score is the lowest), we can describe the rating better with the other ingredients, cocoa percent and number of ingredients. We can notice that all the variables of this new model are significant so this model is more accurate than the previous one shown above. We can conclude the same conclusions as the original model. (Bi of each variable are similar to the original model; Small R2 and F-statistic that shows Bi0)

Conclusion

Throughout our research, we organized our raw chocolate data, visualized certain variables, and conducted tests using hypotheses and linear regression. Our goal was supply our Israeli manufacturer with 5 unique chocolate bars that are worth importing.

We assumed and then proved using the methods stated above: The best chocolate first begins with cocoa that was harvested from South America. Next, the cocoa is manufactured into chocolate bars in North America(specifically in Canada and U.S.A.).In addition, these bars need to contain 2-3 ingredients. These ingredients include Beans,Sugar, and sometimes Cocoa Butter. Chocolate bars that fit in to these categories yielded not only high ratings, but also micro-samples containing the highest rated chocolate bars out of our larger data sample.

We found that it is difficult to choose a specific bar for each category, therefore our manufacturer simply needs to choose from these micro-samples. From our research, we can confidently conclude that whichever bars that our supplier will choose will be the right choice.

Overall, we are very satisfied with these results and hope to work with this Israeli manufacturer in the future.